Model and tokenizer
Model and tokenizer
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--model-path--model | None | Type: str | ✅ | ✅ |
--tokenizer-path | None | Type: str | ✅ | ✅ |
--tokenizer-mode | auto | auto, slow | ✅ | ✅ |
--tokenizer-worker-num | 1 | Type: int | ✅ | ✅ |
--skip-tokenizer-init | False | bool flag (set to enable) | ✅ | ✅ |
--load-format | auto | auto, safetensors | ✅ | ✅ |
--model-loader- extra-config | Type: str | ✅ | ✅ | |
--trust-remote-code | False | bool flag (set to enable) | ✅ | ✅ |
--context-length | None | Type: int | ✅ | ✅ |
--is-embedding | False | bool flag (set to enable) | ✅ | ✅ |
--enable-multimodal | None | bool flag (set to enable) | ✅ | ✅ |
--revision | None | Type: str | ✅ | ✅ |
--model-impl | auto | auto, sglang,transformers | ✅ | ✅ |
HTTP server
HTTP server
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--host | 127.0.0.1 | Type: str | ✅ | ✅ |
--port | 30000 | Type: int | ✅ | ✅ |
--skip-server-warmup | False | bool flag (set to enable) | ✅ | ✅ |
--warmups | None | Type: str | ✅ | ✅ |
--nccl-port | None | Type: int | ✅ | ✅ |
--fastapi-root-path | None | Type: str | ✅ | ✅ |
--grpc-mode | False | bool flag (set to enable) | ✅ | ✅ |
Quantization and data type
Quantization and data type
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--dtype | auto | auto,float16,bfloat16 | ✅ | ✅ | ❌ |
--quantization | None | modelslim | ✅ | ✅ | ❌ |
--quantization-param-path | None | Type: str | ❌ | ❌ | ✅ |
--kv-cache-dtype | auto | auto | ✅ | ✅ | ❌ |
--enable-fp32-lm-head | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--modelopt-quant | None | Type: str | ❌ | ❌ | ✅ |
--modelopt-checkpoint-restore-path | None | Type: str | ❌ | ❌ | ✅ |
--modelopt-checkpoint-save-path | None | Type: str | ❌ | ❌ | ✅ |
--modelopt-export-path | None | Type: str | ❌ | ❌ | ✅ |
--quantize-and-serve | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
--rl-quant-profile | None | Type: str | ❌ | ❌ | ✅ |
Memory and scheduling
Memory and scheduling
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--mem-fraction-static | None | Type: float | ✅ | ✅ |
--max-running-requests | None | Type: int | ✅ | ✅ |
--prefill-max-requests | None | Type: int | ✅ | ✅ |
--max-queued-requests | None | Type: int | ✅ | ✅ |
--max-total-tokens | None | Type: int | ✅ | ✅ |
--chunked-prefill-size | None | Type: int | ✅ | ✅ |
--max-prefill-tokens | 16384 | Type: int | ✅ | ✅ |
--schedule-policy | fcfs | lpm, fcfs | ✅ | ✅ |
--enable-priority-scheduling | False | bool flag (set to enable) | ✅ | ✅ |
--schedule-low-priority-values-first | False | bool flag (set to enable) | ✅ | ✅ |
--priority-scheduling-preemption-threshold | 10 | Type: int | ✅ | ✅ |
--schedule-conservativeness | 1.0 | Type: float | ✅ | ✅ |
--page-size | 128 | Type: int | ✅ | ✅ |
--swa-full-tokens-ratio | 0.8 | Type: float | ✅ | ✅ |
--disable-hybrid-swa-memory | False | bool flag (set to enable) | ✅ | ✅ |
--abort-on-priority-when-disabled | False | bool flag (set to enable) | ✅ | ✅ |
--enable-dynamic-chunking | False | bool flag (set to enable) | ✅ | ✅ |
Runtime options
Runtime options
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--device | None | Type: str | ✅ | ✅ |
--tensor-parallel-size--tp-size | 1 | Type: int | ✅ | ✅ |
--pipeline-parallel-size--pp-size | 1 | Type: int | ✅ | ✅ |
--pp-max-micro-batch-size | None | Type: int | ✅ | ✅ |
--pp-async-batch-depth | None | Type: int | ✅ | ✅ |
--stream-interval | 1 | Type: int | ✅ | ✅ |
--stream-output | False | bool flag (set to enable) | ✅ | ✅ |
--random-seed | None | Type: int | ✅ | ✅ |
--constrained-json-whitespace-pattern | None | Type: str | ✅ | ✅ |
--constrained-json-disable-any-whitespace | False | bool flag (set to enable) | ✅ | ✅ |
--watchdog-timeout | 300 | Type: float | ✅ | ✅ |
--soft-watchdog-timeout | 300 | Type: float | ✅ | ✅ |
--dist-timeout | None | Type: int | ✅ | ✅ |
--base-gpu-id | 0 | Type: int | ✅ | ✅ |
--gpu-id-step | 1 | Type: int | ✅ | ✅ |
--sleep-on-idle | False | bool flag (set to enable) | ✅ | ✅ |
--custom-sigquit-handler | None | Optional[Callable] | ✅ | ✅ |
Logging
Logging
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--log-level | info | Type: str | ✅ | ✅ | ❌ |
--log-level-http | None | Type: str | ✅ | ✅ | ❌ |
--log-requests | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--log-requests-level | 2 | 0, 1, 2, 3 | ✅ | ✅ | ❌ |
--log-requests-format | text | text, json | ✅ | ✅ | ❌ |
--crash-dump-folder | None | Type: str | ✅ | ✅ | ❌ |
--enable-metrics | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--enable-metrics-for-all-schedulers | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--tokenizer-metrics-custom-labels-header | x-custom-labels | Type: str | ✅ | ✅ | ❌ |
--tokenizer-metrics-allowed-custom-labels | None | List[str] | ✅ | ✅ | ❌ |
--bucket-time-to-first-token | None | List[float] | ✅ | ✅ | ❌ |
--bucket-inter-token-latency | None | List[float] | ✅ | ✅ | ❌ |
--bucket-e2e-request-latency | None | List[float] | ✅ | ✅ | ❌ |
--collect-tokens-histogram | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--prompt-tokens-buckets | None | List[str] | ✅ | ✅ | ❌ |
--generation-tokens-buckets | None | List[str] | ✅ | ✅ | ❌ |
--gc-warning-threshold-secs | 0.0 | Type: float | ✅ | ✅ | ❌ |
--decode-log-interval | 40 | Type: int | ✅ | ✅ | ❌ |
--enable-request-time-stats-logging | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--kv-events-config | None | Type: str | ❌ | ❌ | ✅ |
--enable-trace | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--oltp-traces-endpoint | localhost:4317 | Type: str | ✅ | ✅ | ❌ |
RequestMetricsExporter configuration
RequestMetricsExporter configuration
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--export-metrics-to-file | False | bool flag (set to enable) | ✅ | ✅ |
--export-metrics-to-file-dir | None | Type: str | ✅ | ✅ |
API related
API related
Data parallelism
Data parallelism
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--data-parallel-size--dp-size | 1 | Type: int | ✅ | ✅ |
--load-balance-method | round_robin | round_robin,total_requests,total_tokens | ✅ | ✅ |
--prefill-round-robin-balance | False | bool flag (set to enable) | ✅ | ✅ |
Multi-node distributed serving
Multi-node distributed serving
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--dist-init-addr--nccl-init-addr | None | Type: str | ✅ | ✅ |
--nnodes | 1 | Type: int | ✅ | ✅ |
--node-rank | 0 | Type: int | ✅ | ✅ |
Model override args
Model override args
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--json-model-override-args | {} | Type: str | ✅ | ✅ |
--preferred-sampling-params | None | Type: str | ✅ | ✅ |
LoRA
LoRA
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--enable-lora | False | Bool flag (set to enable) | ✅ | ✅ | ❌ |
--max-lora-rank | None | Type: int | ✅ | ✅ | ❌ |
--lora-target-modules | None | all | ✅ | ✅ | ❌ |
--lora-paths | None | Type: List[str] / JSON objects | ✅ | ✅ | ❌ |
--max-loras-per-batch | 8 | Type: int | ✅ | ✅ | ❌ |
--max-loaded-loras | None | Type: int | ✅ | ✅ | ❌ |
--lora-eviction-policy | lru | lru,fifo | ✅ | ✅ | ❌ |
--lora-backend | triton | triton | ✅ | ✅ | ❌ |
--max-lora-chunk-size | 16 | 16, 32,64, 128 | ❌ | ❌ | ✅ |
Kernel Backends (Attention, Sampling, Grammar, GEMM)
Kernel Backends (Attention, Sampling, Grammar, GEMM)
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--attention-backend | None | ascend | ✅ | ✅ | ❌ |
--prefill-attention-backend | None | ascend | ✅ | ✅ | ❌ |
--decode-attention-backend | None | ascend | ✅ | ✅ | ❌ |
--sampling-backend | None | pytorch,ascend | ✅ | ✅ | ❌ |
--grammar-backend | None | xgrammar | ✅ | ✅ | ❌ |
--mm-attention-backend | None | ascend_attn | ✅ | ✅ | ❌ |
--nsa-prefill-backend | flashmla_sparse | flashmla_sparse,flashmla_decode,fa3,tilelang,aiter | ❌ | ❌ | ✅ |
--nsa-decode-backend | fa3 | flashmla_prefill,flashmla_kv,fa3,tilelang,aiter | ❌ | ❌ | ✅ |
--fp8-gemm-backend | auto | auto,deep_gemm,flashinfer_trtllm,cutlass,triton,aiter | ❌ | ❌ | ✅ |
--disable-flashinfer-autotune | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
Speculative decoding
Speculative decoding
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--speculative-algorithm | None | EAGLE3,NEXTN | ✅ | ✅ | ❌ |
--speculative-draft-model-path--speculative-draft-model | None | Type: str | ✅ | ✅ | ❌ |
--speculative-draft-model-revision | None | Type: str | ✅ | ✅ | ❌ |
--speculative-draft-load-format | None | auto | ✅ | ✅ | ❌ |
--speculative-num-steps | None | Type: int | ✅ | ✅ | ❌ |
--speculative-eagle-topk | None | Type: int | ✅ | ✅ | ❌ |
--speculative-num-draft-tokens | None | Type: int | ✅ | ✅ | ❌ |
--speculative-accept-threshold-single | 1.0 | Type: float | ❌ | ❌ | ✅ |
--speculative-accept-threshold-acc | 1.0 | Type: float | ❌ | ❌ | ✅ |
--speculative-token-map | None | Type: str | ✅ | ✅ | ❌ |
--speculative-attention-mode | prefill | prefill,decode | ✅ | ✅ | ❌ |
--speculative-moe-runner-backend | None | auto | ✅ | ✅ | ❌ |
--speculative-moe-a2a-backend | None | ascend_fuseep | ✅ | ✅ | ❌ |
--speculative-draft-attention-backend | None | ascend | ✅ | ✅ | ❌ |
--speculative-draft-model-quantization | None | unquant | ✅ | ✅ | ❌ |
Ngram speculative decoding
Ngram speculative decoding
| Argument | Defaults | Options | A2 | A3 | Experimental |
|---|---|---|---|---|---|
--speculative-ngram-min-match-window-size | 1 | Type: int | ❌ | ❌ | ✅ |
--speculative-ngram-max-match-window-size | 12 | Type: int | ❌ | ❌ | ✅ |
--speculative-ngram-min-bfs-breadth | 1 | Type: int | ❌ | ❌ | ✅ |
--speculative-ngram-max-bfs-breadth | 10 | Type: int | ❌ | ❌ | ✅ |
--speculative-ngram-match-type | BFS | BFS,PROB | ❌ | ❌ | ✅ |
--speculative-ngram-branch-length | 18 | Type: int | ❌ | ❌ | ✅ |
--speculative-ngram-capacity | 10000000 | Type: int | ❌ | ❌ | ✅ |
Expert parallelism
Expert parallelism
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--expert-parallel-size--ep-size--ep | 1 | Type: int | ✅ | ✅ | ❌ |
--moe-a2a-backend | none | none,deepep,ascend_fuseep | ✅ | ✅ | ❌ |
--moe-runner-backend | auto | auto, triton | ✅ | ✅ | ❌ |
--flashinfer-mxfp4-moe-precision | default | default,bf16 | ❌ | ❌ | ✅ |
--enable-flashinfer-allreduce-fusion | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
--deepep-mode | auto | normal, low_latency,auto | ✅ | ✅ | ❌ |
--deepep-config | None | Type: str | ❌ | ❌ | ✅ |
--ep-num-redundant-experts | 0 | Type: int | ✅ | ✅ | ❌ |
--ep-dispatch-algorithm | None | Type: str | ✅ | ✅ | ❌ |
--init-expert-location | trivial | Type: str | ✅ | ✅ | ❌ |
--enable-eplb | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--eplb-algorithm | auto | Type: str | ✅ | ✅ | ❌ |
--eplb-rebalance-layers-per-chunk | None | Type: int | ✅ | ✅ | ❌ |
--eplb-min-rebalancing-utilization-threshold | 1.0 | Type: float | ✅ | ✅ | ❌ |
--expert-distribution-recorder-mode | None | Type: str | ✅ | ✅ | ❌ |
--expert-distribution-recorder-buffer-size | None | Type: int | ✅ | ✅ | ❌ |
--enable-expert-distribution-metrics | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--moe-dense-tp-size | None | Type: int | ✅ | ✅ | ❌ |
--elastic-ep-backend | None | none, mooncake | ❌ | ❌ | ✅ |
--mooncake-ib-device | None | Type: str | ❌ | ❌ | ✅ |
Mamba Cache
Mamba Cache
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--max-mamba-cache-size | None | Type: int | ✅ | ✅ |
--mamba-ssm-dtype | float32 | float32,bfloat16 | ✅ | ✅ |
--mamba-full-memory-ratio | 0.2 | Type: float | ✅ | ✅ |
--mamba-scheduler-strategy | auto | auto, no_buffer,extra_buffer | ✅ | ✅ |
--mamba-track-interval | 256 | Type: int | ✅ | ✅ |
Hierarchical cache
Hierarchical cache
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--enable-hierarchical-cache | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--hicache-ratio | 2.0 | Type: float | ✅ | ✅ | ❌ |
--hicache-size | 0 | Type: int | ✅ | ✅ | ❌ |
--hicache-write-policy | write_through | write_back,write_through,write_through_selective | ✅ | ✅ | ❌ |
--radix-eviction-policy | lru | lru, lfu | ✅ | ✅ | ❌ |
--hicache-io-backend | kernel | kernel_ascend,direct | ✅ | ✅ | ❌ |
--hicache-mem-layout | layer_first | page_first_direct,page_first_kv_split | ✅ | ✅ | ❌ |
--hicache-storage-backend | None | file | ✅ | ✅ | ❌ |
--hicache-storage-prefetch-policy | best_effort | best_effort,wait_complete,timeout | ❌ | ❌ | ✅ |
--hicache-storage-backend-extra-config | None | Type: str | ❌ | ❌ | ✅ |
LMCache
LMCache
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--enable-lmcache | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
Offloading
Offloading
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--cpu-offload-gb | 0 | Type: int | ✅ | ✅ |
--offload-group-size | -1 | Type: int | ✅ | ✅ |
--offload-num-in-group | 1 | Type: int | ✅ | ✅ |
--offload-prefetch-step | 1 | Type: int | ✅ | ✅ |
--offload-mode | cpu | Type: str | ✅ | ✅ |
Args for multi-item scoring
Args for multi-item scoring
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--multi-item-scoring-delimiter | None | Type: int | ✅ | ✅ |
Optimization/debug options
Optimization/debug options
| Argument | Defaults | Options | A2 | A3 | Special | Planned |
|---|---|---|---|---|---|---|
--disable-radix-cache | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--cuda-graph-max-bs | None | Type: int | ✅ | ✅ | ❌ | ❌ |
--cuda-graph-bs | None | List[int] | ✅ | ✅ | ❌ | ❌ |
--disable-cuda-graph | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--disable-cuda-graph-padding | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-profile-cuda-graph | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-cudagraph-gc | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-nccl-nvls | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--enable-symm-mem | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--disable-flashinfer-cutlass-moe-fp4-allgather | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--enable-tokenizer-batch-encode | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--disable-tokenizer-batch-encode | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--disable-outlines-disk-cache | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--disable-custom-all-reduce | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-mscclpp | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--enable-torch-symm-mem | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--disable-overlap-schedule | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-mixed-chunk | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-dp-attention | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-dp-lm-head | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-two-batch-overlap | False | bool flag (set to enable) | ❌ | ❌ | ❌ | ✅ |
--enable-single-batch-overlap | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--tbo-token-distribution-threshold | 0.48 | Type: float | ❌ | ❌ | ❌ | ✅ |
--enable-torch-compile | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-torch-compile-debug-mode | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-piecewise-cuda-graph | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--piecewise-cuda-graph-tokens | None | Type: JSON list | ✅ | ✅ | ❌ | ❌ |
--piecewise-cuda-graph-compiler | eager | [“eager”, “inductor”] | ✅ | ✅ | ❌ | ❌ |
--torch-compile-max-bs | 32 | Type: int | ✅ | ✅ | ❌ | ❌ |
--piecewise-cuda-graph-max-tokens | 4096 | Type: int | ✅ | ✅ | ❌ | ❌ |
--torchao-config | “ | Type: str | ❌ | ❌ | ✅ | ❌ |
--enable-nan-detection | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-p2p-check | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--triton-attention-reduce-in-fp32 | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--triton-attention-num-kv-splits | 8 | Type: int | ❌ | ❌ | ✅ | ❌ |
--triton-attention-split-tile-size | None | Type: int | ❌ | ❌ | ✅ | ❌ |
--delete-ckpt-after-loading | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-memory-saver | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-weights-cpu-backup | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-draft-weights-cpu-backup | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--allow-auto-truncate | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-custom-logit-processor | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--flashinfer-mla-disable-ragged | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--disable-shared-experts-fusion | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--disable-chunked-prefix-cache | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--disable-fast-image-processor | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--keep-mm-feature-on-device | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-return-hidden-states | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-return-routed-experts | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--scheduler-recv-interval | 1 | Type: int | ✅ | ✅ | ❌ | ❌ |
--numa-node | None | List[int] | ✅ | ✅ | ❌ | ❌ |
--rl-on-policy-target | None | fsdp | ❌ | ❌ | ❌ | ✅ |
--enable-layerwise-nvtx-marker | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
--enable-attn-tp-input-scattered | False | bool flag (set to enable) | ❌ | ❌ | ❌ | ❌ |
--enable-nsa-prefill-context-parallel | False | bool flag (set to enable) | ✅ | ✅ | ❌ | ❌ |
--enable-fused-qk-norm-rope | False | bool flag (set to enable) | ❌ | ❌ | ✅ | ❌ |
Dynamic batch tokenizer
Dynamic batch tokenizer
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--enable-dynamic-batch-tokenizer | False | bool flag (set to enable) | ✅ | ✅ |
--dynamic-batch-tokenizer-batch-size | 32 | Type: int | ✅ | ✅ |
--dynamic-batch-tokenizer-batch-timeout | 0.002 | Type: float | ✅ | ✅ |
Debug tensor dumps
Debug tensor dumps
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--debug-tensor-dump-output-folder | None | Type: str | ✅ | ✅ |
--debug-tensor-dump-layers | None | List[int] | ✅ | ✅ |
--debug-tensor-dump-input-file | None | Type: str | ✅ | ✅ |
PD disaggregation
PD disaggregation
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--disaggregation-mode | null | null,prefill,decode | ✅ | ✅ | ❌ |
--disaggregation-transfer-backend | mooncake | ascend | ✅ | ✅ | ❌ |
--disaggregation-bootstrap-port | 8998 | Type: int | ✅ | ✅ | ❌ |
--disaggregation-decode-tp | None | Type: int | ✅ | ✅ | ❌ |
--disaggregation-decode-dp | None | Type: int | ✅ | ✅ | ❌ |
--disaggregation-ib-device | None | Type: str | ❌ | ❌ | ✅ |
--disaggregation-decode-enable-offload-kvcache | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--disaggregation-decode-enable-fake-auto | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--num-reserved-decode-tokens | 512 | Type: int | ✅ | ✅ | ❌ |
--disaggregation-decode-polling-interval | 1 | Type: int | ✅ | ✅ | ❌ |
Encode prefill disaggregation
Encode prefill disaggregation
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--encoder-only | False | bool flag (set to enable) | ✅ | ✅ |
--language-only | False | bool flag (set to enable) | ✅ | ✅ |
--encoder-transfer-backend | zmq_to_scheduler | zmq_to_scheduler, zmq_to_tokenizer,mooncake | ✅ | ✅ |
--encoder-urls | [] | List[str] | ✅ | ✅ |
Custom weight loader
Custom weight loader
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--custom-weight-loader | None | List[str] | ✅ | ✅ | ❌ |
--weight-loader-disable-mmap | False | bool flag (set to enable) | ✅ | ✅ | ❌ |
--remote-instance-weight-loader-seed-instance-ip | None | Type: str | ✅ | ✅ | ❌ |
--remote-instance-weight-loader-seed-instance-service-port | None | Type: int | ✅ | ✅ | ❌ |
--remote-instance-weight-loader-send-weights-group-ports | None | Type: JSON list | ✅ | ✅ | ❌ |
--remote-instance-weight-loader-backend | nccl | transfer_engine, nccl | ✅ | ✅ | ❌ |
--remote-instance-weight-loader-start-seed-via-transfer-engine | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
For PD-Multiplexing
For PD-Multiplexing
| Argument | Defaults | Options | A2 | A3 | Special |
|---|---|---|---|---|---|
--enable-pdmux | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
--pdmux-config-path | None | Type: str | ❌ | ❌ | ✅ |
--sm-group-num | 8 | Type: int | ❌ | ❌ | ✅ |
For Multi-Modal
For Multi-Modal
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--mm-max-concurrent-calls | 32 | Type: int | ✅ | ✅ |
--mm-per-request-timeout | 10.0 | Type: float | ✅ | ✅ |
--enable-broadcast-mm-inputs-process | False | bool flag (set to enable) | ✅ | ✅ |
--mm-process-config | None | Type: JSON / Dict | ✅ | ✅ |
--mm-enable-dp-encoder | False | bool flag (set to enable) | ✅ | ✅ |
--limit-mm-data-per-request | None | Type: JSON / Dict | ✅ | ✅ |
For checkpoint decryption
For checkpoint decryption
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--decrypted-config-file | None | Type: str | ✅ | ✅ |
--decrypted-draft-config-file | None | Type: str | ✅ | ✅ |
--enable-prefix-mm-cache | False | bool flag (set to enable) | ✅ | ✅ |
For deterministic inference
For deterministic inference
| Argument | Defaults | Options | A2 | A3 | Planned |
|---|---|---|---|---|---|
--enable-deterministic-inference | False | bool flag (set to enable) | ❌ | ❌ | ✅ |
For registering hooks
For registering hooks
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--forward-hooks | None | Type: JSON list | ✅ | ✅ |
Configuration file support
Configuration file support
| Argument | Defaults | Options | A2 | A3 |
|---|---|---|---|---|
--config | None | Type: str | ✅ | ✅ |
Other Params
Other Params
The following parameters are not supported because the third-party components that depend on are not compatible with the
NPU, like Ktransformer, checkpoint-engine etc.
The following parameters have some functional deficiencies on community
| Argument | Defaults | Options |
|---|---|---|
--checkpoint-engine- wait-weights- before-ready | False | bool flag (set to enable) |
--kt-weight-path | None | Type: str |
--kt-method | AMXINT4 | Type: str |
--kt-cpuinfer | None | Type: int |
--kt-threadpool-count | 2 | Type: int |
--kt-num-gpu-experts | None | Type: int |
--kt-max-deferred-experts-per-token | None | Type: int |
| Argument | Defaults | Options |
|---|---|---|
--enable-double-sparsity | False | bool flag (set to enable) |
--ds-channel-config-path | None | Type: str |
--ds-heavy-channel-num | 32 | Type: int |
--ds-heavy-token-num | 256 | Type: int |
--ds-heavy-channel-type | qk | Type: str |
--ds-sparse-decode-threshold | 4096 | Type: int |
--tool-server | None | Type: str |
